在實際的資料分析中,通常需同時考量多個自變項對依變項的影響,自變項的資料型態可以是連續型的或類別型的,藉此建立較符合實際情況的羅吉斯迴歸模式,當自變數超過一個以上,則稱為多元羅吉斯迴歸分析(multivariate logistic regression),又可稱為多元或複羅吉斯迴歸分析。


一、多元羅吉斯迴歸模型 (Multivariate Logistic Regression Models)


範例:河盲症 (Oncho_ems)


資料背景:

針對河盲症研究何種因子與感染風險有關。


編碼簿:

A data frame with 1302 observations on the following 7 variables. (individual data)

變數名稱 變數涵義 資料編碼方式 資料型態
id subject unique id number 類別
mf Microfiliarial infection 0 = no; 1 = yes 類別
area Area of residence 0 = Savannah;
1 = Rainforest
類別
agegrp age group 0 = 5-9; 1 = 10-19; 2 = 20-39; 3 = ≥40 類別
sex gender 0 = male; 1 = female 類別
mfload No. of mf in skin snip 0 = none; 1 = 1-9; 2 = 10-49; 3 = ≥50 類別
lesions Presence of any eye lesions 0 = no; 1 = yes 類別


練習題:

Part 1 建立多元羅吉斯迴歸模型

  1. 探討區域對於微絲蟲感染之影響?

  2. 探討年齡對於微絲蟲感染之影響?

  3. 探討性別對於微絲蟲感染之影響?

  4. 在調整區域性與年齡層後,微絲蟲感染與性別之關係?


Part 2 ROC 曲線





參考解答:


匯入資料,並查看資料
# 匯入資料
# getwd() #查看目前所在資料夾位置
# setwd('PATH/') # win: 'D:/' or 'C:/'; mac: '~/Desktop/'
dat <- read.csv('Oncho_ems.csv', header = TRUE, stringsAsFactors = TRUE)

# 查看前六筆資料
head(dat)
##   id mf area agegrp sex mfload lesions
## 1  1  1    0      2   1      1       0
## 2  2  1    1      3   0      3       0
## 3  3  1    0      3   1      1       0
## 4  4  0    1      2   1      0       0
## 5  5  0    0      3   1      0       0
## 6  6  0    1      2   1      0       0
# 查看資料大小
dim(dat)
## [1] 1302    7

這個資料有7欄,1302橫列。


# 查看變數資料型態
str(dat)
## 'data.frame':    1302 obs. of  7 variables:
##  $ id     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mf     : int  1 1 1 0 0 0 1 1 1 1 ...
##  $ area   : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ agegrp : int  2 3 3 2 3 2 3 2 3 3 ...
##  $ sex    : int  1 0 1 1 1 1 0 1 0 1 ...
##  $ mfload : int  1 3 1 0 0 0 1 1 1 3 ...
##  $ lesions: int  0 0 0 0 0 0 0 0 0 1 ...


由編碼簿可以知道所有變數應該都要是類別型

# 轉換變數型態
dat$id = as.factor(dat$id)
dat$mf = as.factor(dat$mf)
dat$area = as.factor(dat$area)
dat$agegrp = as.factor(dat$agegrp)
dat$sex = as.factor(dat$sex)
dat$mfload = as.factor(dat$mfload)
dat$lesions = as.factor(dat$lesions)


# 再次查看變數資料型態
str(dat)
## 'data.frame':    1302 obs. of  7 variables:
##  $ id     : Factor w/ 1302 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ mf     : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 2 2 2 ...
##  $ area   : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
##  $ agegrp : Factor w/ 4 levels "0","1","2","3": 3 4 4 3 4 3 4 3 4 4 ...
##  $ sex    : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 2 1 2 ...
##  $ mfload : Factor w/ 4 levels "0","1","2","3": 2 4 2 1 1 1 2 2 2 4 ...
##  $ lesions: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...


Part 1 建立多元羅吉斯迴歸模型


1. 探討區域對於微絲蟲感染之影響?
lm3 = glm(mf ~ area, data = dat, family = binomial(link = "logit"))
summary(lm3)
## 
## Call:
## glm(formula = mf ~ area, family = binomial(link = "logit"), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5900  -1.1992   0.8148   0.8148   1.1558  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.05111    0.08546   0.598     0.55    
## area1        0.88102    0.11767   7.487 7.05e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1714.1  on 1301  degrees of freedom
## Residual deviance: 1657.0  on 1300  degrees of freedom
## AIC: 1661
## 
## Number of Fisher Scoring iterations: 4


# Exponentiated coefficients (odds ratios)
exp(coef(lm3))
## (Intercept)       area1 
##    1.052434    2.413363
# 2.413363是area的勝算比(odds ratio)


根據summary(lm3),可以將模式寫成以下方程式:

\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = 0.051 + 0.881area\)

上列迴歸方程式可解釋為住在Rainforest地區的人被微絲蟲感染的勝算(odds)是住在Savannah地區的2.413(\(e^{0.881}\))倍,因此可以知道住在Rainforest地區的人會比住在Savannah地區更容易被感染。

(area: Area of residence (0 savannah, 1 forest))


\(7.05 \times 10^{-14}\)為檢定\(H_0: \beta_{area}=0\)\(p\)值,p-value = \(7.05 \times 10^{-14} < 0.05\),故在此模型中area的影響在統計上是顯著的。



2. 探討年齡對於微絲蟲感染之影響?
lm4 = glm(mf ~ agegrp, data = dat, family = binomial(link = "logit"))
summary(lm4)
## 
## Call:
## glm(formula = mf ~ agegrp, family = binomial(link = "logit"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8681  -0.7189   0.6196   0.8358   1.7203  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2212     0.1678  -7.279 3.37e-13 ***
## agegrp1       1.0372     0.2160   4.802 1.57e-06 ***
## agegrp2       2.0933     0.1987  10.534  < 2e-16 ***
## agegrp3       2.7741     0.2081  13.332  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1714.1  on 1301  degrees of freedom
## Residual deviance: 1455.7  on 1298  degrees of freedom
## AIC: 1463.7
## 
## Number of Fisher Scoring iterations: 4


# Exponentiated coefficients (odds ratios)
exp(coef(lm4))
## (Intercept)     agegrp1     agegrp2     agegrp3 
##   0.2948718   2.8213372   8.1120000  16.0239130


根據summary(lm4),可以將模式寫成以下方程式:

\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.221 + 1.037agegrp1 + 2.093agegrp2 + 2.774agegrp3\)

上列迴歸方程式可解釋為年齡層在10-19歲的人被微絲蟲感染的勝算是年齡層在5-9歲的2.821(\(e^{1.037}\))倍; 年齡層在20-39歲的人被微絲蟲感染的勝算是年齡層在5-9歲的8.112(\(e^{2.093}\))倍; 年齡層在>=40歲的人被微絲蟲感染的勝算是年齡層在5-9歲的16.024(\(e^{2.774}\))倍。 因此可以知道年紀越大的人越容易被感染。

(agegrp: Age group (0: 5-9, 1: 10-19, 2: 20-39, 3: >=40))



另外,可以藉由log likelihood ratio test執行以下檢定:

\(H_{0}: \beta_{agegrp1} = \beta_{agegrp2} = \beta_{agegrp3} = 0\)

\(H_{A}: \beta_{agegrp1} \neq 0 \text{ or } \beta_{agegrp2} \neq 0 \text{ or } \beta_{agegrp3} \neq 0\)

(註:也就是檢定在這個模型裡,agegrp變數整體而言是否為顯著的變數)

library(car)
## Loading required package: carData
Anova(lm4, test = "LR", type = "II")
## Analysis of Deviance Table (Type II tests)
## 
## Response: mf
##        LR Chisq Df Pr(>Chisq)    
## agegrp    258.4  3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

G=258.4,p-value(< 2.2e-16) <0.05,因此agegrp這個變數在此模型裡是個顯著的變數。



3. 探討性別對於微絲蟲感染之影響?
lm6 = glm(mf ~ sex, data = dat, family = binomial(link = "logit"))
summary(lm6)
## 
## Call:
## glm(formula = mf ~ sex, family = binomial(link = "logit"), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5338  -1.3122   0.8589   1.0483   1.0483  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80742    0.08724   9.255  < 2e-16 ***
## sex1        -0.49588    0.11655  -4.255 2.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1714.1  on 1301  degrees of freedom
## Residual deviance: 1695.7  on 1300  degrees of freedom
## AIC: 1699.7
## 
## Number of Fisher Scoring iterations: 4


# Exponentiated coefficients (odds ratios)
exp(coef(lm6))
## (Intercept)        sex1 
##   2.2421053   0.6090335


根據summary(lm6),可以將模式寫成以下方程式:

\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = 0.807 - 0.496sex\)

上列迴歸方程式可解釋為女生被感染的勝算是男生的0.609(\(e^{-0.496}\))倍,因此可以知道男生比女生更容易被微絲蟲感染。

(sex: Sex (0: male, 1: female))


\(2.09 \times 10^{-5}\)為檢定\(H_0: \beta_{sex}=0\)\(p\)值,p-value < 0.05,故在此模型中sex的影響是顯著的。



4. 在調整區域性與年齡層後,微絲蟲感染與性別之關係?
lm7 = glm(mf ~ sex + agegrp + area, data = dat, family = binomial(link = "logit"))
summary(lm7)
## 
## Call:
## glm(formula = mf ~ sex + agegrp + area, family = binomial(link = "logit"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2212  -0.7666   0.5392   0.7081   2.1459  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.6157     0.2066  -7.820 5.30e-15 ***
## sex1         -0.5813     0.1356  -4.286 1.82e-05 ***
## agegrp1       0.9429     0.2239   4.211 2.54e-05 ***
## agegrp2       2.3478     0.2108  11.138  < 2e-16 ***
## agegrp3       2.8713     0.2171  13.225  < 2e-16 ***
## area1         1.1227     0.1386   8.099 5.52e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1714.1  on 1301  degrees of freedom
## Residual deviance: 1366.1  on 1296  degrees of freedom
## AIC: 1378.1
## 
## Number of Fisher Scoring iterations: 4


# Exponentiated coefficients (odds ratios)
exp(coef(lm7))
## (Intercept)        sex1     agegrp1     agegrp2     agegrp3       area1 
##   0.1987542   0.5591696   2.5673568  10.4623748  17.6593452   3.0731378

根據summary(lm7),可以將模式寫成以下方程式:

\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581sex + 0.943agegrp1 + 2.348agegrp2 + 2.871agegrp3 + 1.123area\)

上列迴歸方程式可解釋為住在同一地區且年齡層相同的人,女性被感染的勝算是男性的0.559(\(e^{-0.581}\))倍;因此可以知道對於住在同一地區且年齡層相同的人而言,男性比女性更容易被微絲蟲感染。




Part 2 ROC曲線


lm7為例


安裝pROC package
# install.packages("pROC")
library(pROC)
## Warning: package 'pROC' was built under R version 4.0.5
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var


計算Prob(被微絲蟲感染)的預測值(pre)
dat$pre = predict(lm7, dat, type = "response")
head(dat)
##   id mf area agegrp sex mfload lesions       pre
## 1  1  1    0      2   1      1       0 0.5376278
## 2  2  1    1      3   0      3       0 0.9151558
## 3  3  1    0      3   1      1       0 0.6624600
## 4  4  0    1      2   1      0       0 0.7813405
## 5  5  0    0      3   1      0       0 0.6624600
## 6  6  0    1      2   1      0       0 0.7813405

這是剛剛在part 1估計出來的lm7模式:

\(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581sex + 0.943agegrp1 + 2.348agegrp2 + 2.871agegrp3 + 1.123area\)


代入數值即可求得pre,例如:

id 1 : \(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581 \times 1 + 0.943 \times 0 + 2.348 \times 1 + 2.871 \times 0 + 1.123 \times 0 = 0.151\)

\(\hat{p}_{被微絲蟲感染} = \frac{1}{1 + e^{-0.151}} = 0.5376784\)


id 2 : \(logit(\hat{p}_{被微絲蟲感染}) = log(\frac{\hat{p}_{被微絲蟲感染}}{1-\hat{p}_{被微絲蟲感染}}) = -1.616 - 0.581 \times 0 + 0.943 \times 0 + 2.348 \times 0 + 2.871 \times 1 + 1.123 \times 1 = 2.378\)

\(\hat{p}_{被微絲蟲感染} = \frac{1}{1 + e^{-2.378}} = 0.9151\)




ROC曲線
rocobj = roc(dat$mf, dat$pre) # roc(實際觀察到的結果, 藉由模式(lm7)預測出來的結果)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases


plot(rocobj, 
     print.thres = TRUE,  # print.thres = TRUE: the threshold with the highest sum sensitivity + specificity is plotted 
     grid = TRUE)

rocobj$auc
## Area under the curve: 0.7924
ci.auc(rocobj)
## 95% CI: 0.7669-0.8179 (DeLong)

可得0.669為最佳切點,此時的Sensitivity:0.680,Specificity:0.771。

AUC = 0.7924 (0.7669, 0.8179)

(註: 最佳切點的判斷依據是希望 sensitivity + specificity 有最大值)


在不同切點(cutoff value, threshold)下的sensitivity和specificity:

data.frame(threshold = rocobj$thresholds, 
           sensitivity = rocobj$sensitivities, 
           specificity = rocobj$specificities)
##    threshold sensitivity specificity
## 1       -Inf   1.0000000  0.00000000
## 2  0.1329109   0.9939173  0.08541667
## 3  0.1938949   0.9805353  0.16041667
## 4  0.2382886   0.9708029  0.23125000
## 5  0.2962281   0.9513382  0.32708333
## 6  0.3585290   0.9343066  0.36041667
## 7  0.4231920   0.9172749  0.42916667
## 8  0.5024109   0.8819951  0.50833333
## 9  0.5741204   0.8004866  0.63958333
## 10 0.6365365   0.7420925  0.70416667
## 11 0.6688628   0.6800487  0.77083333
## 12 0.7267649   0.6119221  0.81666667
## 13 0.7798023   0.5279805  0.86458333
## 14 0.8195604   0.4111922  0.92291667
## 15 0.8612349   0.2603406  0.93958333
## 16 0.8899226   0.1630170  0.96458333
## 17       Inf   0.0000000  1.00000000

可以利用曲線下的面積(AUC)來判別ROC曲線的鑑別力,AUC 數值的範圍從0到1,數值愈大愈好。以下為AUC 數值一般的判別規則:

  • AUC=0.5 (無鑑別力)

  • 0.7≦AUC≦0.8 (可接受的鑑別力)

  • 0.8≦AUC≦0.9 (優良的鑑別力)

  • 0.9≦AUC≦1.0 (極佳的鑑別力)



補充


[觀察值(data) → 建立logistic regression model → 預測值pre → cutoff point → sensitivity & specificity → ROC曲線]


當cutoff point = 0.669時,如果Prob(被微絲蟲感染)的預測值pre>0.669,這個人就會被判定成「會被感染」,反之,若pre<0.669,則判定為「不會被感染」

out = ifelse(dat$pre > 0.669, 1, 0)


接著,把實際上觀察到的結果dat$mf,和剛剛透過切點(0.669)判斷出來的結果out,計算各種可能組合的次數,可得下面這個列連表:

addmargins(table(mf = dat$mf, predict = out))
##      predict
## mf       0    1  Sum
##   0    370  110  480
##   1    263  559  822
##   Sum  633  669 1302

這時候就可以算sensitivity和specificity了!

sensitivity = 559/822 = 0.6800487

specificity = 370/480 = 0.7708333



二、練習


範例: 前列腺癌 (pros)


資料背景:

針對前列腺癌研究何種指標能夠有效進行疾病預測。


編碼簿:

A data frame with 380 observations on the following 9 variables.

變數名稱 變數涵義 資料編碼方式 資料型態
ID Identification code 類別
CAPSULE Tumor penetration of prostatic capsule 0 = No penetration;
1 = Penetration
類別
AGE Age in years 連續
RACE Race 1 = White; 2 = Black 類別
DPROS Results of the digital rectal exam 1 = No nodule;
2 = Unilobar nodule (left);
3 = Unilobar nodule (right);
4 = Bilobar nodule (left)
類別
DCAPS Detection of capsular involvement in rectal exam 1 = No; 2 = Yes 類別
PSA Prostatitis specific antigen value (mg/ml) 連續
VOL Tumor volume obtained from ultrasound (cm3) 連續
GLEASON Total gleason score 連續


練習題:

  1. 以DPROS 為解釋變數進行羅吉斯迴歸模式分析。請列出迴歸係數,並說明DPROS對於CAPSULE的影響(計算OR,並以OR 的角度解釋)。請以likelihood ratio test 檢定DPROS 變數是否顯著。

  2. 以PSA為解釋變數進行羅吉斯迴歸模式分析。請列出迴歸係數,並說明PSA對於CAPSULE的影響(計算OR,並以OR 的角度解釋)。

  3. 以DPROS, PSA與DCAPS為解釋變數進行羅吉斯迴歸模式分析。請列出迴歸係數,並說明各變數對於CAPSULE的影響(計算OR,並以OR 的角度解釋)。請以likelihood ratio test檢定變數是否顯著。請畫出此模型的ROC曲線與計算ROC曲線下的面積,說明其最佳cutoff value 的sensitivity與specificity為何?並解釋此ROC曲線。